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Abstract 

Regularization is a well studied problem in the context of neural networks. It is 
usually used to improve the generalization performance when the number of input 
samples is relatively small or heavily contaminated with noise. The regularization 
of a parametric model can be achieved in different manners some of which are 



early stopping ( ^Morgan and Bourlard 1990J , weight decay, output smoothing that 
are used to avoid overfitting during the training of the considered model. From 
a Bayesian point of view, many regularization techniques correspond to imposing 
certain prior distributions on model parameters (Krogh and Hertz 1991). Using 



Bishop's approximation (Bishop 1995) of the objective function when a restricted 



type of noise is added to the input of a parametric function, we derive the higher 
order terms of the Taylor expansion and analyze the coefhcients of the regularization 
terms induced by the noisy input. In particular we study the effect of penalizing the 
Hessian of the mapping function with respect to the input in terms of generalization 
performance. We also show how we can control independently this coefficient by 
explicitly penalizing the Jacobian of the mapping function on corrupted inputs. 

1 Introduction 

Regularization is a well studied problem in the context of neural networks. It is usually 
used to improve the generalization performance when the number of input samples is 
relatively small or heavily contaminated with noise. The regularization of a parametric 



model can be achieved in different manners some of which are early stopping ( Morgan and 



Bourlard 1990), weight decay or output smoothing, and are used to avoid overfitting 
during the training of the considered model. From a Bayesian point of view, many 
regularization techniques correspond to imposing certain prior distributions on model 



parameters (Krogh and Hertz 1991). 



In this paper we propose a novel approach to achieve regularization that combines noise 
in the input and explicit output smoothing by regularizing the L2-norm of the Jaco- 



bian's mapping function with respect to the input. Bishop (1995) has proved that the 
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two approaches are essentially equivalent under some assumptions using a Taylor ap- 
proximation up to the second order of the noisy objective function. Using his theoretical 
analysis, we derive the approximation of our cost function in the weak noise limit and 
show the advantage of our technique from the theoretical and empirical point of view. In 
particular, we show that we achieve a better smoothing of the output of the considered 
model with a little computational overhead. 



2 Definitions 



For the ease of readability, most of our analysis involves only vectors and matrices except 
for section[6]for which it was not possible to avoid using tensors objects. Also our analysis 
assumes that the model's output is scalar which will prevent the use of tensors for the 
low order terms of the Taylor expansion. We will use the following notations: 

• (.,.) : inner product, 

• (8) : tensor product, 

• Jf{x), Hf{x), T^p\x) : respectively the Jacobian, Hessian and n-th order deriva- 
tive of / with respect to vector x. 

We consider the following set of points: 

Vn = {zi = {xi,yi) G (M^M)lVi G [l;nl} 

where the {xi,yi) are the (input, target) of an arbitrary dataset. In the paper we will 
consider a particular family of parametric models 



J": 



and Fq{M.'^) C M. We define the expected cost of the true distribution p{z) of our data 
points as being: 

c{e) = [ c{z,e)p{z)dz (1) 



The expected empirical cost when using the data without noise can be expressed as: 

^cican(^) = j c{z, 9)6{z, - z)dz = - ^ £(z„ 9) (2) 

i 

where 5 is the Dirac function. When adding noise to the input, we will consider p{z) as 
being a parzen density estimator: 

n 

p{z) = -Y.Hz.-z) (3) 



n 



with the kernels ip centered on the points of our dataset. In the rest of this paper we 
will consider kernels for which the following assumptions hold: 
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(a) every kernel has zero mean, 

(b) different components of a kernel are independent. 

Note that the normal and uniform distribution have these properties. Using (a), we can 



write (|Ah|[l996j): 



, j e,^{e)de = {) (4) 
and using (b) : 

j eiejij{e)de = a'^6ij (5) 

where a is the variance of the distribution (p, and e = (ei, . . . ,£d)- In our analysis we 
will restrict ourselves to gaussian kernels: 

i;{zi - z) = M^^^^z) (6) 

Substituting ([6]) in ([s]) we can write the objective function with noisy input as being: 

1 r 

Cnoisy(^) = -Y^J £{z, 9)K^,A^)dz (7) 
i 

3 Penalty term induced by noisy input 



Bishop ( |1995| ) already showed that tuning the parameters of a model with corrupted 
inputs is asymptotically equivalent to minimizing the true error function when simul- 
taneously decreasing the level of corruption to zero as the number of corrupted inputs 
tends to infinity. He also showed that adding noise in the input is equivalent to mini- 
mize a different objective function that includes one or more penalty terms, and he uses 
a Taylor expansion to derive an analytic approximation of the noise induced penalty. 
Using the above assumption we can write: 

C'^™«y(0) =ri^a'i(^) + 0(0) (8) 

where (/) is the penalty term. Substituting Q and ([2]) in ([s]) we express the penalty term 
as being: 



Iff 

m = -Y, J C{z,9)Af,^^,2{z)dz-C{z„i 



(9) 



We define the noise vector as being e = z — Zi and omitting 6 for simplicity we can write 
Vi, the term inside the sum of (§: 

D = J C{zi + e)Mo^^2{e)de-Cizi) (10) 



3 



Now that we have identified the term to approximate, let's write the Taylor approxima- 
tion of our loss function when our sample is shifted by a noise vector e: 



C{z + e) = C{z) + {Jc{z),e) + ^e^.Hc{z).e + o{e'^) 



To match equation ( 10 ) we multiply with ^ and integrate with respect to e both sides 
of pTl): 



£.{z + e)7Vo_o-2(e)de = 
£{z) + {Jc{z),e) + ^e^.Hc{z).e + o(e) 



(12) 



J\fQ„2{e)de 



Equation Q implies that all odd-moments of the approximation are null and in con- 
junction with ([5]) we can now simplify (12) into: 



£{z + e)7Vo 0-2 {e)de 



C{z) J Afo^^2{e)de + j {Jc{z),e)Afo^2{e)de 



(13) 



+ e'^.Hciz).eAfo,^2{£)de + R 
and with some algebra we can finally write: 

/2 
C{z + eWo,.2{e)de - £(z) « ^ Tt{Hc{z)) 

by substituting z = Zi and summing over all the elements of "Dn- 

2 n 



a' 
2n 



^Tr{Hc{z,)) 



(14) 



(15) 



Hence training with corrupted inputs is approximately equivalent to minimize the fol- 
lowing objective: 



(16) 



This relation holds for any objective C with the above two assumptions. 



All the above results are already well established in the literature of noise injection (Bishop 



1995, An 1996). Our contribution is to show that, for the second order Taylor expan- 



sion, adding noise to the input on a well chosen regularized objective is equivalent to add 
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a L2-norm on the Hessian of the output function of the considered model respectively 
to its input, which is not the case when adding noise to a non-regularized objective. 
In the following sections we will consider the MSE as the objective function to tune the 
parameters of our model but this choice does not affect the generality of our analysis. 



4 Non regularized objective 

We define the following error function: 

and the Hessian of the cost function being the average over the Hessian of all individual 
errors: 



1 

-f/^clcan(0) = — ^ ^ H /T clean (^Zi , 9) 



Assuming without loss of generality that F is a scalar functiorQ and that the noise is 
added to the input x, we will only consider the Hessian of the loss function with respect 
to X : 



i?£clcan [x) 



d 



d_ 
dx 



d_ 
dx 



' dx 



OF {x, ( 
dx 



F{x,9)-y 



iF{x,9)-y) 



d'^Flx, 



{F{x,9)-y) + 



dF{x,9) 



-f7£cican(x) = 2Hf{x) {F{x,9) -y) + 2{jF{xf,JF{x)) 

A standard result of linear algebra is the relation between the Frobenius norm and the 
trace operator: 

Tr((A^,^)) = 11^11^ 
By taking the trace of the above results, we get: 

Tr (i7£ciean(x)) = 2 {F {x , 9) - y) Tr {Hf{x)) + 2\\Jf{x)\\1 



and plugging this in ( |16[ ) gives us the following second order approximation of the noisy 
objective: 



(z„ 9) «|| 9) - vi f +2A 9) - y) Tr {Hf{x)) + \\Jf{x)\\f 



(17) 



^in the case of multiple regression such as a reconstruction function, only the orders of the tensors 
involved in the approximation of C will change. 
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We obtain a L2-norm on the gradient of the mapping function F added to our error 
function whereas the Hessian term is not constrained to be positive and it is not sure 
that its terms are going to cancel-out. E.g. if prediction overshoots or undershoots on 
average, then penalty may encourage very large Hessian trace inducing high curvature 
which would potentially harm a stochastic gradient descent and converge to a poor local 
minimum. 



5 Our regularized objective 

As the results of the previous section suggest, adding noise to the input of the objective 
function yields an undesirable term that might interfere with the goal of smoothing the 
output of our function. We propose here to overcome this difficulty by adding noise only 
to the input of the objective function's Jacobian Jp, doing so will avoid the unpredictable 
effect of additional unwanted terms. We define the error function we previously used 
to which we add a regularization term that is the L2-norm of the Jacobian of F with 
respect to x : 

^,eg+noise(^^^^) =|| 0) - f +A 1 1 Ji.(x,) 1 1^ 

We now calculate the Hessian of C by omitting the first term since the noise is added 
only to the input of the regularization term, and using the approximation derived in 



(16): 



dF {x^,^ 
dx 



dF{xi,e) 
dx 



+ e'Tr I H::sFn2ix) 
F Vile: 



We can now calculate the Hessian of the regularization term as being: 



aF||2 

dx I I 



[x) 



.A 

dx 



d'^F{x,e) dF{x,e) 
[dxY dx 



+ 



9F I |2 
dx 1 1 



(x) 



d^F{x,0)dF{x,0) , /d^F{x,e) 
(dx)^ dx 

d^F{x, 



Tr ( H,,gp,,2{x) 

I I dx I I 



(dxY 



2Tr 



(dx)"^ 

JF{x) + 2{HF{xf,HF{x)) 



d^F{x,9) 
{dxY 



Jf{x) +2\\Hf{x) 



which gives us the approximation of our regularized objective: 



(18) 



£rcg^ 



F{xi, 9) - y, f +\\\Jf{x)\\1 + 2\a^\\HF{x)\\l + R 



(19) 



We have shown that adding noise to a well chosen regularized objective clearly penalize 
the L2-norm on the Hessian of the considered model ^(without ever calculating it) using 
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a second order Taylor approximation of the noisy objective under two necessary assump- 
tions on the noise distribution. In statistics regularizing the norm of the derivatives of 
the model to be tuned is often referred as roughness penalty (Green, 1993) and is used 



in the context of cubic splines (De Boor, 1998). 



6 Higher order terms of the Taylor expansion 



In this section we are interested in the higher order terms of the cost approximation, we 
find it convenient to use the following formalism: 

if T2{z,£) denotes the n-th order derivative of C with respect to 2, then: 



ii,...,i„ 



where T" is a tensor of order n and 



dzi^ , . . . , dzi^ 



using this formalism we can write the fourth order derivative as being: 



24 

Using the two assumptions made on the noise distribution, we know that the third order 
derivative of the approximation is zero. As for the fourth order derivative, using the 
second assumption of the noise distribution we know that only the terms that are on the 
diagonal of the will be non-zero, we can then write: 




Using the above result we can approximate our cost function in the noisy input setting 
more finely, for this purpose we will use the results obtained above for the Hessian and 
differentiate them again twice with respect to x. 



(He jx)) 

(dx)'^ 
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(dx)' 



[2HFix) {F{x, 9)-y) + 2 (Jpixf, Jf{^^ 



d 



2^ [T|(x) (F(x, e)-y)+ 3HFix)JF{a 



2 [tHx) (F(x, 9)-y) + AT^f{^)Jf{x) + 3 {Hf{xY . Hf{x))] 
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hence, 



Tv{Tliz))=Q\\HF{x)\\l + TiiR) 



where R = 2T^{x) {F{x, 6) - y) + 8T'I,{x)Jf{x). 



7 Comparison 

7.1 Noise added to the input of the objective function 

Now that we have a higher order approximation in the case where noise is added to the 
input of the function, we can compare the magnitude of the coefficients that penahze 
the Hessian Hp, note that in this case the Hessian term appears in the fourth order 
of the Taylor expansion of the cost function, whereas we need only a second order 
approximation in the case where we add a regularization term evaluated on a corrupted 
input. We can write the approximation of the cost function without regularization at 
the fourth order as being: 



v^-*- ^clcan 



'{z) + -TT{Hc.r..n{z))+ ^, 



(^)) 



£— (^) =11 F{x„ 6) - r \\Jf{xWp + -\\Hf{x)\\'p + R 



(20) 



where the number i of overline denotes the terms induced by the noise obtained at the 
z-th order of the Taylor expansion. 

7.2 Noise added to the input of the Jacobian of the objective function 

In this case we just need to approximate the cost function up to the second order of the 
Taylor expansion: 



^rcg+noise(^.^ 0) =|| F{xi, 9) - y, f +A 1 1 Ji.(x) 1 1 ^ + 2Xa^ \\H f{x)\\'p + R 



(21) 



8 Experimental results 



We have tried several experiments in order to benchmark the effect of regularization 



and noise combined, for this task we used the well known MNIST (LeCun et al. 1998), 



MNIST binarised and the USPS database. Surprisingly, we were able to achieve results 



close to those obtained with unsupervised pretraining (Erhan et al, 2010). MNIST is 



composed of 70A; handwritten digits represented by a vector of pixels. It is divided in 50k 
for the training set, lOfc for each of the validation and test set, the range of the features 
were rescaled to be within [0,1]. MNIST-Binary is divided exactly the same way as 
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MNIST, the only difference is that the intensity of the pixels superior to ^ where set 
to 1 and the others to 0. USPS dataset consists of a training set with 7291 images and 
a test set with 2007 images, the validation set was formed using the last 2000 images of 
the training set. The model F{x), with parameters 9 = {W^^\ . . . , W^'^\ b^^\ . . . , 6^"^}, 
we considered to solve the classification task was a standard neural network with one or 
more hidden layers, and a hyperbolic tangent non-linearity in between the layers. For 
example, with one hidden layer we have: 

F{x) = cr(M^(2) tanh(VF(^) .X + 

where a{.) is the logistic sigmoid function. In this setting we can write the Frobenius 
norm of the Jacobian of F with respect to x as being: 




\\Jf{x)\\' = Y^4 

and with some calculus we get: 

Jij = F,(x)(l - Fi{x)) V//^ ( 1 - tanh^ [ ^ VF^x, 

where Fi{x) is i-th output of the network. For the results in table ([T]), we used a number 
of hidden units ranging from 400 to 1000 per layer, the best results were obtained with 
two hidden layers on MNIST, and one hidden layer on MNIST-BINARY and USPS. 
The parameters of the model were learned through stochastic gradient descent with a 
learning rate ranging from 0.1 to 0.001. We also investigated the use of Rectifying units 
(i.e. max(0,x)) ( |Nair and Hinto"ii 2010 Glorot et al., 2010) as non-linearity in the 



hidden layer, surprisingly they seemed to benefit less from the added noise to the input 
than from the regularization term alone, they achieved a test classification performance 
of 4.8% on the USPS dataset equaling the best performance of the hyperbolic tangent 
activation with both regularization and added noise to the input. The best results where 
obtained with a gaussian isotropic noise with a standard deviation of 0.1 around training 
samples. 

Figure [T] shows the histograms of activation values on the MNIST test set of our best 
MLPs with and without Jacobian regularization. The activations of the regularized 
model are more densely distributed at the saturation and linear regime. 



9 Discussion 

9.1 Constraining the solution space 

When optimizing a non-convex function with an infinite amount of local minimum, it 
is not clear which of them yields a reasonable generalization performance, the concept 
of overfitting clearly illustrate this point. The proposed regularizer tries to avoid this 
scheme by fiattening the mapping function over the training examples inducing a local 
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-1 -0.5 0.5 1 

Activation value 

Figure 1: Normalized histograms of the activation values on the MNIST test set for the 
best MLPs with and without Jacobian regularization. The activations of the regularized 
model are more densely distributed at the saturation and linear regime. 



invariance in the mapping space to infinitesimal variance in the input space. Figure [2] 
shows that when the input is corrupted, the models learned with the regularization term 
are more robust to noise on the input. Geometrically, the added regularization imposes to 
the model to be a Lipschitz function or a contracting map around the training examples 
imposing the constraint F(x + e) ~ F[x). 

9.2 Smoothing away from the training points 

Penalizing only the Jacobian of the model F with respect to the input x gives only a 
guarantee of flatness for an infinitesimal move away of the training example. To illustrate 
this point one can imagine that on the tip of a dirac function the norm of the jacobian 
is null and infinite just around it. Although this situation is not possible in the context 
of neural networks because of their smooth activation function. Given enough capacity 
we could converge to this solution if we do not add additional constraints. One of them 
would be to adjust the locality of the flatness as a hyper-parameter of the model. It 
requires to compute the higher order terms of the mapping function in order to regularize 
the magnitude of their norms. As it was discussed before, it is computationally expansive 
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Table 1: Test error 



Model 


MNIST 


MNIST-BINARY 


USPS 


MLP 


1.82% 


2.01% 


5.74% 


MLP + L2 


L66% 


2.20% 


5.6% 


MLP + Noise 


1.33% 


1.7% 


4.85% 


MLP + Jacob 


1.31% 


1.65% 


4.85% 


MLP + N + J 


1.19% 


1.51% 


4.6% 



to explicitly calculate the norm of the high order derivatives because their number of 
components increases exponentially, instead our approach proposes an approximation of 
the Hessian term that allows you to simultaneously control the magnitude of both the 
Jacobian and hessian norms independently. 

9.3 The other terms induced by the noise 

As we have seen in the equation [T7| the added noise does not yield only in a penalty 
on the norm of the successive derivatives of the mapping function and it is somehow 
unclear how these terms behave during gradient descent since they are not constrained 
to be positive. In a supervised setting, it is empirically feasible to drive those terms 
to zero because of the small dimensionality of the target points, whereas in a multi- 
dimensional regression task such as the reconstruction objective of an auto-encoder it 
is often impossible to achieve a "near" zero minimization of the cost with a first order 
optimization such as a stochastic gradient descent. The reader should note that the 
approximation of the noisy cost is valid when the number of corrupted inputs tends 
to infinity, though in practise this is never the case. It would be interesting to do an 
estimate of the difference between the terms induced by the noise and the real values of 
the term in function of the number of corrupted samples. 

10 Conclusion 

We have shown how to obtain a better generalization performance using a regularization 
term that adds a marginal computational overhead compared to the traditional approach. 
Using a Taylor expansion of the cost function, we also showed that by adding noise to 
the input of the regularization term we are able to penalize with a greater magnitude the 
norm of the higher order derivatives of the model avoiding the need to explicitly calculate 
them, which would be obviously computationally prohibitive. Initial results suggests that 
different parametric models clearly benefit from this approach in terms of predicting out- 
of-sample points. It would be interesting to investigate how this regularization approach 
would behave when used with non-parametric models such as gaussian-mixtures. 
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Figure 2: Robustness of the generalization error with respect to a gaussian corruption 
noise added to the input, the model trained with the combination of input noise and 
Jacobian regularization is more robust. 
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